Introduction

Most people spend a large part of their adult lives to work in paid jobs. Therefore, the quality of employment is significantly associated with individual well-being and life satisfaction. For organizations, though it tends to be costly to create high-quality jobs, the initial investments can eventually pay off with higher productivity. Therefore, identifying the core components of job quality and improving working conditions are of interests to both parties: employers and employees.

With that in mind, this project has three main objectives:

Key components of a good job

A good job should consist of various aspects of working conditions that bring positive improvements to employees. According to Holman (2013), there are five main aspects of job quality:

Personality and socio-demographics

The notion of a "good" or "bad" job may vary from person to person, depending on their expectation and standards. For instance, young employees may expect a job with more autonomy, flexible environments, and many opportunities to learn, while older workers may put more importance on fewer working hours and more security.

Two relevant proxies can help quantify individuals' expectations at work:

Data overview and transformation

The data in this reasearch was extracted from the SOEP-CORE.v36, a longitudinal study conducted by the German Institute for Economic Research (DIW Berlin) and involved more than 30,000 individuals and 15,000 households in Germany since 1984. The original package contains 15.5 GB of data with 59 datasets on various topics.

After completing initial research, I determined 4 necessary datasets with relevant data for the project:

Import data

All datasets are in the .dta extension, so I use read_stata to import them. For the pl and big5 datasets, because the original file is very large (almost 5GB), I used Stata to pre-selected the necessart variables and exported into the .csv file for quicker importing.

Though the SOEP contains annual data, I restrict the sample to the year 2001 due to the availability of information about job satisfaction. The only exception is the data for personality traits, which are only available from the year 2005 onward. Researchers have shown that the development of human personality is relatively constant across the life span, so this approach is reasonable.

Merge data

Once all datasets are imported, it's time to combine them into one single dataframe. pgen is merged with hgen on the hid key (household id), and then with other datasets on pid key (personal id). After merging, I drop unnecessary variables and rename columns for easier interpretations.

Transform data

Let's take a look at the combined dataset:

The good_job dataset contains 16,748 observations and 35 columns. Since data come from surveys in the raw formats, it is necesasry to clean and transform them before analysis.

Descriptive statistics

The final sample for analysis consists of 7,517 individuals with 35 variables in total. The table belows represents more details about descriptive statistics of all variables.

The dependent variable in the analysis is job satisfaction, which is measured on a scalar scale from 0 (totally dissatisfied) to 10 (totally satisfied). On average, most individuals are satisfied with their job (mean = 7.15). Other variables are divided into 3 groups for further analysis:

Socio-demographic (birthyear to part_exp) and personality (efficient to worry) features

The sample consists of a relatively equal proportion of male and female participants who are in the prime phases of their working lives with the average age at 41 and well trained or educated (12.2 years). A relative fraction of employees with a migration background is included, which represents the increasing importance of migrants to the German economy and society. The average monthly household income is about €2563, which is slightly below the national level at €2646 in 2001.

9 personality variables associated with the Big Five model are included: efficient and thorough (conscientiousness), forgive and friendly (agreeableness), stress and worry (neuroticism), reserved and sociable (extraversion), and imagination (openness to experience). Respondents rated their temperaments themselves on a scale from 1 (not agree) to 7 (compltely agree). Based on descriptive statistics, individuals in the sample receive high scores on most traits, especially thorough, efficient, friendly, forgive, and sociable.

Objective job features (full_time to work_time)

The data includes 6 objective aspects of employment that are not affected by the respondents’ opinions or feelings. Overall, most people have a full-time job with a permanent contract, 38.53 working hours and 2.21 overtime hours per week. The job_status shows the perceived perception of the general public on the occupations. In this project, it is represented by the magnitude prestige scale (MPS) developed by Wegener (1988), which ranges from 30 (low prestige) to 216 (high prestige). Lastly, the monthly net income is around €1373/month on average.

Subjective job features (accident to varied_job)

Subjective job features are coded into dummy variables with 0 (not agree) and 1 (agree). Overall most respondents have relatively good working conditions with a very high level of task diversity (0.94), good relationships with colleagues (0.97), great degrees of autonomy at work (0.83), and many opportunities to learn (0.81). In contrast, their jobs are involved high levels of stress (0.79) and a relative fraction is strictly monitored (0.62). The manual_labor indicates that the ratio of blue-collar to white-collar employees is quite equal. For physical conditions, most people does not have to work in hazardous environments or jobs with high risks of accidents.

What affects job satisfactions?

To identify which factors determine job quality, the following multiple regression model is used: $Job\_satisfaction_i = \beta X_i + u_i$. Using ordinary least squares (OLS) for estimation, this model can help determine statically significant variables and the signs of their effects.

Results

Overall, 21 out of 34 variables are statically significant to job satisfaction.

Level of income and the number of working hours are both statically significant. Higher salaries remarkably make employees more contended, while more working time makes them unhappy. Also, people with permanent contracts are more satisfied at work than those who are employed temporarily. Three other objective job features, including total number of overtime hours, job status, and full-time positions, do not show any statistical significance on individual satisfaction at work.

Many subjective job features are statically significant. Employees are likely to feel more satisfied when they have good relationships with colleagues, a low number of conflicts with seniors, more opportunities to learn, and more variety of assigned. In contrast, too much stress at work and hazardous working conditions are negatively associated with job satisfaction. Other variables in the group are statically insignificant, including high levels of manual tasks, special working shifts, and high risks of accidents.

The results also show that many socio-demographic features affect job satisfaction. Gender is a key determinant as the numbers indicate that male workers are more satisfied than female counterparts. Concerning the level of monthly net household income, originating from more affluent families tend to increase workers' satisfaction at work. Another interesting finding is that people with a migration background are more satisfied than native German employees. In contrast, more years of education and training reduce job satisfaction, which can be explained by their higher expectations or aspirations for their careers. Surprisingly, there is no statically significant difference between different age groups.

Most personality variables are statistically significant. Both neuroticism features have a strong effect on the dependent variable as those people who cannot deal well with stress and worry frequently tend to feel more dissatisfied at work. The same holds true for conscientiousness: levels of job satisfaction are higher in those workers who can work thoroughly and efficiently. Concerning agreeableness, only forgiveness has a statically significant effect, while friendliness is not a key determinant of job satisfaction. Openness to experience (represented by a lively imagination) is statically significant for individual satisfaction at work. Both extraversion variables, namely reserved and sociable, are not statically significant to job satisfaction.

Canonical correlation analysis

Theory

Canonical correlation analysis (CCA) is a statistical technique that helps determine possible relationships between two sets of variables.

Let consider two sets of variables: $X = (X_1,…, X_p)$ and $Y = (Y_1,…, Y_q)$. The simplest way to examine their relationships is to look at the correlation matrix between them. However, when p and q are large, it results in the following graph, which can be difficult and time-consuming to interpret.

CCA helps resolve this problem by examining the inter-correlations between the two sets rather than the correlation between each variable in both sets. The objective is to acquire a smaller number of derived canonical variables that represent high correlations between the two groups.

To implement CCA, let $U_m = a_{m1}X_1 + … + a_{mp}X_p$ and $V_m = b_{m1}Y_1 + … + b_{mq}Y_q$ are linear combinations of all variables in set X and Y, respectively.

We need to find $U_1 = a_{11}X_1 + … + a_{1p}X_p$ and $V_1 = b_{11}Y_1 + ... + b_{1q}Y_q $, such that the correlation between them $\rho_1$ = corr ($U_1$, $V_1$) is the largest possible across all combination weights for $U_1$ and $V_1$. $U_1$ and $V_1$ are the first pair of canonical correlation variables, and $\rho_1$ is called the first canonical correlation. We continue the process until a total of k = min (p, q) pairs of canonical variables and corresponding correlations are acquired.

In general, we need to find vectors a and b to maximize $\rho(a,b)$, with

$\rho(a,b) = \frac{a'\Sigma_{12}b}{(a'\Sigma_{11}a)(b'\Sigma_{22}b)^{1/2}}$

with Σ be the cross-covariance matrix of U and V. This maximization problem can be solved by applying the Lagrange multiplier method.

There are two ways to interpret CCA results:

To avoid bias due to scale differences, all data should be standardized to have a mean of zero and a standard deviation of one.

The results can be seen in this table. Thereare 6 canonical correlations between 6 objective and 12 subjective job features. All of these are statistically significant. However, it is worth noting that even small correlations would be statistically significant for large sample sizes. Therefore, I only attempt to interpret the first two correlations and ignore the others due to very weak associations.

The first canonical correlation is 0.56, which suggests a relatively strong link between the first two canonical variables. The level of net income and job status are the two main components of $U_1$, while $V_1$ is primarily comprised of physical conditions at work, such as high risks of accidents, hazardous working conditions, and manual tasks, which have different signs with levels of job satisfaction. Overall, the first pair indicates that individuals with low-prestige jobs and lower income levels are more likely to work in poorer conditions with high risk of accidents, manual labor, and hazardous environments. They are fairly dissatisfied with their job.

In the second pair, $U_2$ includes the number of working hours, overtime hours, levels of net income, and full-time positions. All of these components have the same sign and contrast with job status. In general, the canonical $U_2$ represent individuals in lower-than-average prestige jobs who are employed in full-time positions, have high levels of net income, and have to work for long hours per week. The working conditions of the workers are likely to involve high stress levels, hazardous environments, and high risks of accidents, which are the main contributors of $V_2$. Concerning their well-being, they are slightly dissatisfied with their job.

Based on the results, 4 out of 8 pairs of canonical variables provide us with insights into working conditions in different socio-demographic groups.

In the first pair represents well-educated or well-trained workers. Two canonical variables $U_1$ and $V_1$ are strongly correlated (0.729). The level of education and household income are the main components of $U_1$, which indicates that male and young people who come from affluent families are more well-educated or well-trained. In contrast, the level of personal income, job status, number of working hours, and full-time positions are the main constituents of $V_1$. Overall, the first pair suggests that people with low levels of training and household income tend to have jobs with lower income, part-time contracts, more working hours, and low prestige. They are relatively dissatisfied with their job.

Gender and total part-time experiences are the main components of $U_2$. The opposite signs denote that women tended to accept part-time jobs. The other canonical variable $V_2$ is mainly comprised of full-time positions, job status, the total number of working hours, level of monthly net personal income, manual labor, risk of accident, and hazardous conditions. We may conclude that female workers are more likely to have part-time jobs with low income levels, short working time, low risk of accidents, safe working conditions, and low manual tasks. The link is quite strong with a correlation of 0.67.

The third pair provides an insight into working conditions in different age groups. Overall, younger workers have fewer experiences in full-time and part-time positions than older employees. The association between $U_3$ and $V_3$ proposes that these individuals tend to work in positions with lower levels of net income, temporary contracts, more opportunities to participate in training and development programs, as well as more diverse tasks at work.

The fifth pair is about migrant employmes. $U_1$ is dominated by the migration background, suggesting that these individuals tend to be male, old, have more years unemployment, and fewer years in full-time jobs than others. Overall, they tend to be relatively satisfied with their job.

K-means clustering

Theory

In this part, I use K-means clustering to split participants into groups based on their personality traits and level of job satisfaction. To understand the intuition behind this technique, let $I = (I_1, ..., I_N)$ denotes the set of individuals in the analysis sample (N = 7,506). Based on $p=10$ clustering variables, the objective is to split $I$ into $k$ clusters $C_1, ... C_k$, where $k$ is pre-specified. The final resutls should satisfy the following conditions:

The key to cluster analysis is to choose a suitable way to measure the degree of similarity between individuals. The most common approach involves the use of Euclidean distance, which helps measure how close one individual is to the cluster center. Ideally, the sum of all distances in each cluster should be as small as possible. This is the notion behind within-cluster variation $W(C_k)$, defined as follows:

$W(C_k) = \frac{1}{|C_k|}\sum\limits_{i,i' \in C_k}\sum\limits_{j = 1}^{p}(x_{ij}-x{i'j})^2$

where $|C_k|$ represents the number of individuals in the cluster $C_k$ and the latter sum denotes the Euclidean distance between point i and i' for p variables. Within-cluster variation measures the compactness of clustering. It allows us to quantify the differences between individuals in a cluster.

The objective of K-means clustering is to minimize the total within-cluster variation over all clusters in the sample.

$$\underset{C_1,...C_k}{minimize}{\sum\limits_{k = 1}^{K}W(C_k)}$$

Some K-means clustering algorithms are available to solve this optimization problem. In this project, I follow the standard approach suggested by Hartigan and Wong (1979).

Implementation

Overall, three well-separated groups are determined.

Cluster 1 includes individuals with many positive qualities of employees. They are very sociable and less reserved, very imaginative, friendly and able to forgive other people easily, work efficiently and thoroughly, and able to deal with stress well and less worried. These individuals are relatively satisfied with their occupation.

Cluster 2 appears to be an inferior version of cluster 1. They have many positive features, namely efficiency, thoroughness, forgiveness, and friendliness, but the cluster means are lower than those in the first group. Conversely, the signs of neuroticism and openness variables contrasted with those in group 1. This indicates they are not so imaginative, worry a lot, and cannot cope with stress well. Group 2 also has a low score of extraversion as they are less sociable and more reserved. The level of job satisfaction is low in this cluster.

Cluster 3 represents individuals with almost opposite personality traits to those in the first group. They work very inefficiently and are not good at organizing tasks, not very imaginative, very unfriendly and unwilling to show affection to others, do not worry a lot about work-related matters, and lack capabilities to handle stress. Overall, this group includes respondents with negative features and a relatively low level of job satisfaction.

The following image shows us a clearer picture of these three distinctive groups. It can be observed that the three clusters are separated quite well, though there are still some overlaps between them.

cluster.png

Conclusion